Data 621 Homework 4

Introduction

In this assignment, we will explore, analyze and model a data set containing approximately 8000 records, each representing a customer at an auto insurance company. Each record has two response variables. The first response variable, TARGET_FLAG, is binary. A “1” indicates that the customer was in a car crash while 0 indicates that they were not. The second response variable is TARGET_AMT. This value is 0 if the customer did not crash their car. However, if they did crash their car, this number will be a value greater than 0.

The objective is to build multiple linear regression and binary logistic regression models on the training data to predict whether a customer will crash their car and to predict the cost in the case of crash. We will only use the variables given to us (or variables that we derive from the variables provided).

Below is a short description of the variables of interest in the data set:

Data Exploration

Data Exploration

The dataset consists of 26 variables and 8161 observations with AGE, YOJ, and CAR_AGE variables containing some missing values. As stated previously, TARGET_FLAG and TARGET_AMT are our response variables. Also, 13 of the variables have discrete values and the rest of the variables are continuous.

Data summary
Name data_train
Number of rows 8161
Number of columns 26
_______________________
Column type frequency:
factor 14
numeric 12
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
INCOME 0 1 FALSE 6613 $0: 615, emp: 445, $26: 4, $48: 4
PARENT1 0 1 FALSE 2 No: 7084, Yes: 1077
HOME_VAL 0 1 FALSE 5107 $0: 2294, emp: 464, $11: 3, $11: 3
MSTATUS 0 1 FALSE 2 Yes: 4894, z_N: 3267
SEX 0 1 FALSE 2 z_F: 4375, M: 3786
EDUCATION 0 1 FALSE 5 z_H: 2330, Bac: 2242, Mas: 1658, <Hi: 1203
JOB 0 1 FALSE 9 z_B: 1825, Cle: 1271, Pro: 1117, Man: 988
CAR_USE 0 1 FALSE 2 Pri: 5132, Com: 3029
BLUEBOOK 0 1 FALSE 2789 $1,: 157, $6,: 34, $5,: 33, $6,: 33
CAR_TYPE 0 1 FALSE 6 z_S: 2294, Min: 2145, Pic: 1389, Spo: 907
RED_CAR 0 1 FALSE 2 no: 5783, yes: 2378
OLDCLAIM 0 1 FALSE 2857 $0: 5009, $1,: 4, $1,: 4, $4,: 4
REVOKED 0 1 FALSE 2 No: 7161, Yes: 1000
URBANICITY 0 1 FALSE 2 Hig: 6492, z_H: 1669

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
INDEX 0 1.00 5151.87 2978.89 1 2559 5133 7745 10302.0 ▇▇▇▇▇
TARGET_FLAG 0 1.00 0.26 0.44 0 0 0 1 1.0 ▇▁▁▁▃
TARGET_AMT 0 1.00 1504.32 4704.03 0 0 0 1036 107586.1 ▇▁▁▁▁
KIDSDRIV 0 1.00 0.17 0.51 0 0 0 0 4.0 ▇▁▁▁▁
AGE 6 1.00 44.79 8.63 16 39 45 51 81.0 ▁▆▇▂▁
HOMEKIDS 0 1.00 0.72 1.12 0 0 0 1 5.0 ▇▂▁▁▁
YOJ 454 0.94 10.50 4.09 0 9 11 13 23.0 ▂▃▇▃▁
TRAVTIME 0 1.00 33.49 15.91 5 22 33 44 142.0 ▇▇▁▁▁
TIF 0 1.00 5.35 4.15 1 1 4 7 25.0 ▇▆▁▁▁
CLM_FREQ 0 1.00 0.80 1.16 0 0 0 2 5.0 ▇▂▁▁▁
MVR_PTS 0 1.00 1.70 2.15 0 0 1 3 13.0 ▇▂▁▁▁
CAR_AGE 510 0.94 8.33 5.70 -3 1 8 12 28.0 ▆▇▇▃▁

Data Processing


Fix data types

We noticed that a few variables that are listed as discrete have large numbers of unique values. A closer inspection of the variable descriptions reveals that that while these variables are encoded as factors they are actually continuous. The TARGET_FLAG variable also appears in the summary as numeric variable, but it should be a binary factor. We proceed to fix these data types.

Fix bad and missing values

Also, there are some values that seem invalid (i.e. -3 CAR_AGE). Since both variables the missing values are less than 5% then we can replace the missing values with the median. We Will take the median on the training set only and impute in both training and testing to avoid overfitting.

Univariate charts

We now explore the distribution of TARGET_FLAG across the numeric variables. We see that BLUEBOOK, INCOME, OLDCLAIM have a high number of outliers compared to other variables. We also see that customers with who are older, or have older cars, higher home values, higher income tend to get into fewer car crashes. However, people with motor vehicle record points or high number of old claims tend to get into more accidents.

The variables dislayed below need scale transformations like OLDCLAIM, INCOME, BLUEBOOK, HOME_VAL. AGEhas a guassian distribution. We see several variables have high number of zeros. AGE is the only variable that is normally distributed. Rest of the variables show some skewness. We will perform Box-Cox transformation on these variables.

Correlation

We see MVR_PTS, CLM_FREQ, and OLDCLAIM are the most positively correlated variables with our response variables. Whereas, URBANICITY is the most negatively correlated variable. Rest of the variables are weakly correlated.

Centrality Measures and Outliers

As was previously noted this distribution has a long tail. The mean payout is $5616 and the median is $4102. The median and mean are higher, of course for those observations we classified as outliers. The outlier cutoff point is $10594.


Data Preparation

Sampling


   0    1 
6008 2153 

There is an imbalance in the TARGET_FLAG variable

Let’s check the class distribution


        0         1 
0.7361843 0.2638157 

The data contains only 26% that has already did an accident and 74% of negative flag. This is severly imbalanced data set. This would affect the accuracy score in the model building step if untreated.

To treat this unbalance, we would use the over sampling

check the balance again


   0    1 
6008 6008 

Influential Leverage Points


Model Building

# A function to extract the relevant metrics from the summary and confusion matrix
build_model <- function(id, formula, data) {
  glm.fit <- glm(formula, data=data, family=binomial)
  print(summary(glm.fit))
  glm.probs <- predict(glm.fit, type="response")
  # Confirm the 0.5 threshold
  glm.pred <- ifelse(glm.probs > 0.5, 1, 0)
  results <- tibble(target=data$target, pred=glm.pred)
  results <- results %>%
    mutate(pred.class = as.factor(pred), target.class = as.factor(target))
  
  #print(confusionMatrix(results$pred.class,results$target.class, positive = "1"))
  
  acc <- confusionMatrix(results$pred.class,results$target.class, positive = "1")$overall['Accuracy']
  sens <- confusionMatrix(results$pred.class,results$target.class, positive = "1")$byClass['Sensitivity']
  spec <- confusionMatrix(results$pred.class,results$target.class, positive = "1")$byClass['Specificity']
  prec <- confusionMatrix(results$pred.class,results$target.class, positive = "1")$byClass['Precision']
  res.deviance <- glm.fit$deviance
  null.deviance <- glm.fit$null.deviance  
  aic <- glm.fit$aic
  metrics <- list(res.deviance=res.deviance, null.deviance=null.deviance,aic=aic, accuracy=acc, sensitivity=sens, specificity=spec, precision=prec)
  metrics <- lapply(metrics, round, 3)
  
  plot(roc(results$target.class,glm.probs), print.auc = TRUE)
  model.df <- tibble(id=id, formula=formula, res.deviance=metrics$res.deviance, null.deviance=metrics$null.deviance, 
                         aic=metrics$aic, accuracy=metrics$accuracy, sensitivity=metrics$sensitivity, specificity=metrics$specificity, precision=metrics$precision)
  model.list <- list(model=glm.fit, df_info=model.df)
  return(model.list)
}

Model 1:

Model 2:

Model 3:

Model Selection